1 Introduction

This program is developed based on the Shiny framework, a set of R packages and a collection of scripts written by members of Junhyong Kim Lab at University of Pennsylvania. Its goal is to facilitate fast and interactive RNA-Seq data analysis and visualization. Current version of PIVOT supports routine RNA-Seq data analysis including normalization, differential expression analysis, dimension reduction, correlation analysis, clustering and classification. Users can complete workflows of DESeq2, monocle and scde package with just a few button clicks. All analysis reports can be exported, and the program state can be saved, loaded and shared.

See http://kim.bio.upenn.edu/software/pivot.shtml for more details.

2 Launch PIVOT

  • To run PIVOT, in Rstudio console, use command

    library(PIVOT)
    pivot()
  • For advanced users, if you want to only load needed modules,

Then you can either use

pivot_module()

which shows the available modules in PIVOT:

ID Module
1 PIVOT.analysis
2 DESeq2
3 edgeR
4 scde
5 monocle
6 PIVOT.network
7 caret
8 PIVOT.toolkit

Then use pivot(#ID_vector) to launch selected modules, e.g., pivot(c(1,2,3)) to launch PIVOT with the base PIVOT module, DESeq2 and edgeR.

Alternatively, use

pivot_launcher()

to launch a window to directly pick modules or install required components.

3 Data Input

3.1 Input expression matrix

  • To input expression matrix, select “Counts Table” as input file type. PIVOT expects the count matrix to have rows as genes and samples as columns. Gene names and sample names should be the first column and the first row, respectively.

  • PIVOT support expression matrix in csv, txt, xls or xlsx formats. Choose proper settings on the left file input panel until the right “Loaded File Preview” correctly shows the data frame.

  • you need to make sure that the data matrix:

    • Contains no NA or non-numeric values.

    • Does not have duplicated feature or sample names (PIVOT will alert the user if it detects any).

3.2 Input counts directory

  • If you have counts files with genes in the first column and gene counts in the second column. You can input the entire data folder. PIVOT will automatically combine the counts files into an expression matrix. Note you need to make sure that the files are all quantified using the same pipeline with the same annotation file.

  • PIVOT also allows you to select subset of files in the folder. E.g., type ‘exon’ in the filter textbox to only select files with “exon” in the file name.

3.3 Input 10x Cell Ranger output

3.4 Data Normalization

We have included a multitude of normalization methods in PIVOT. Some normalization methods are mostly used for bulk RNA-Seq data, while others may only be applicable to single-cell data. Users should be aware of which method best suits their data.

  • If your data has already been processed by DESeq or other methods, please specify “none” in the normalization method.

  • If DESeq failed on your data, one possibility is that you have low counts samples, which leads to all the genes contain at least one 0 in the counts matrix. You can either find out and remove these samples, or choose the “modified DESeq” normalization method. Setting sample inclusion 100% is equivalent to the original DESeq.

  • Some normalization method require additional information from the user. For example, if you choose ERCC normalization then you must provide experimental parameters. Please also make sure the ERCC feature names matches those in the standard table: (https://tools.thermofisher.com/content/sfs/manuals/cms_095046.txt).

  • PIVOT applies a pre-filtering step before doing normalization. By default, PIVOT will filter out genes with all 0 expressions. Users can also specify a different row mean or row sum threshold to remove those low confidence features.

  • Once data have been normalized, you can check the normlization details which contain information such as the estimated size factors.

3.5 Input Design Information

  • The design infomation are used for sample point coloring and differential expression analysis. Users can input the entire sample meta sheet as design information, or manually specify groups or batches for each sample.

  • The first column of the design table should always be the sample name list, which must include all samples that’s in the expression matrix. The rest columns will be treated as “categories” or “design variables”, which can be “condition”, “batch”, “operator”, “experiment date”, etc. You will be able to choose which category to be used for analysis such as DE, as well as if the category should be treated as categorical or numerical.

  • You can also manually make a design-info file by specifying the sample grouping in PIVOT, and download it for later upload.

  • Empty or NA entries are allowed in design table. But You need to make sure the categories used for DE testing (conditions/time points/batches) do not contain NAs.

  • The design categories will be used as “meta” info for sample coloring in many modules, such as PCA or heatmap.

3.6 Load Example Dataset

  • We have included in PIVOT an example dataset from Dueck, Hannah, et al., 2015. This data consists of a subset of cells from 5 different cell types (brown adipocytes (BA), cardiomyocytes (CM), cortical pyramidal neurons (CPN) and hippocampal pyramidal neuron (HPN)). User can press the question mark button on the top right corner of PIVOT and the “Load Example” button to load this example dataset. Results of many modules have already been pre-computed for users to explore all features of PIVOT.

4 Data Manipulation

4.1 Feature ID/Name Conversion

PIVOT supports automatic ID/Name conversion and supports most of the ID/Names listed in the BioMart database. Despite this powerful feature, we still recommend users to get the feature name right in the first place (i.e., use proper gene id/name output when doing the count quantification), as id/name conversion could inevitably lead to id or name duplicates, and unmapped features.

  • For features that has no match in the database, PIVOT will keep its original ID.
  • For features that map to multiple entries in the database, PIVOT will use the first matched entry.
  • For multiple features that map to one entry in the database, PIVOT will name the duplicates as ‘name.1’, ‘name.2’,etc.

4.2 Feature Filtering

  • There are currently 3 types of feature filter in PIVOT: the expression filter, which filters based on various expression statistics; the feature list filter, which filters based on user input gene list; and the P-value filter, which filters data with differentially expressed genes.

Expression Filter: PIVOT incorperates the scater package to provide users various QC plots for feature filtering. You can choose any of the stats and select range to only keep those features within the specified range.

  • Filter can be either positive (selecting the genes that satisfy the condition) or negative (delete the genes).

Feature List Filter: You can provide a marker feature list to get a marker feature expression subset.

  • The marker features should appear as the first column in your file.

  • Some of the features in your marker list may not be found in the dataset, because they may have already been removed due to 0 expression in all your samples.

P-Value Filter:

  • Once users have performed DE analysis, PIVOT will automatically load the results to the P-value filter to allow users to filter the data with significant DEGs. You can change the P value threshold, as well as the direction of log fold change.

Removing Filtering Effects

  • Applying will return the data to the unfiltered state, i.e., if you have performed sample subsetting prior to filtering, the dataset will return to the sample subset.

4.3 Sample Subsetting

  • Subsetter allows you to choose a subset of samples for analysis. You can either manually select samples, groups, upload a sample list or subset based on sample statistics.

  • To filter based on sample statitics, you can directly drag on the sample stats plot to specify a range.

  • You can choose whether or not the subset should be renormalized. You can change the re-normalization method and associated parameters in the “FILE” panel.

  • An implicit filtering will occur to get nonzero count genes for the subset. This procedure prevents some downstream analysis from breaking on 0s.

  • Applying will return the data to the original input dataset, which means all filtering or subsetting effects will be removed.

4.4 Data Map

  • With each data filtering/subsetting operation, you are creating data subsets whose lineage can be tracked using the data map. Mouse over each edge in the map will show you the operation details, and you can switch between data subsets, rename nodes, delete subsets or add notes by simply selecting the nodes and click buttons.

  • To attach analysis to the nodes, simply click the magnet button on the top right corner of each analysis box. The rest of the buttons are for pasteing R markdown reports to the report module, generating HTML reports and box collapsing.

5 Analysis Modules

5.1 Data Scale

  • For most analysis modules, you can choose one of the four data scales:

    • counts (normalized) : DESeq normalized counts;

    • Log10 : log10 (normalized_counts + 1). Plus one to include zeros;

    • Standardized : Standardization (calculate Z-scores) is performed on the DESeq normalized counts;

    • Log10 & Standardized : Standardization (calculate Z-scores) is performed on log10(normalized_counts + 1), assuming log-normal distribution.

  • For each individual analysis, please choose the most proper data scale. Some modules have fixed data scale choice (e.g., raw counts input for DESeq differential analysis) so this option is not available.

5.2 Data Table

  • You can download data table with different data scales and ordering. If your original data is multiple counts files in a folder, you can also download the combined raw count table or normalized table for single file input.

    alt text

    alt text

  • The relative frequency of a gene is defined as its raw count divided by the total counts of the sample.

5.3 Basic Data Statistics

  • This module contains simple statistics of your data, including some useful plots and table for quality control.

5.4 ERCC

  • If your data contains ERCC spike-in, PIVOT will plot the ERCC read distribution and the estimated molecules based the standard table (https://tools.thermofisher.com/content/sfs/manuals/cms_095046.txt).

  • The default parameter is 0.9 µL ERCC with 1:4000000 dilution added per sample. The user will need to adjust the parameters according to the protocol used.

    alt text

    alt text

5.5 Differential Expression Analysis

5.5.1 DESeq Differential Expression Analysis

5.5.2 SCDE

  • This module is a graphical interface for the SCDE package (http://hms-dbmi.github.io/scde/diffexp.html).

  • The SCDE error modeling must be performed first before you can use other SCDE analysis. For large dataset the modeling process can be very slow. You can monitor the progress in the background R session.

  • You can use SCDE distance for hierarchical clustering and minimum-spanning-tree generation. There are three types of adjustment method you can choose: direct drop-out, reciprocal weighting and mode relative weighting. For details of these methods please check the SCDE website. Once a distance has been computed, it is loaded into PIVOT to be used in other modules.

    alt text

    alt text

5.5.3 Mann-Whitney Test

  • Also known as wilcoxon rank sum test. The null hypothesis is that the distributions of the gene expression in the two groups has no difference and the alternative is that they differ by some non-zero location shift.

  • You can choose the P adjustment method. The default method is Bonferroni correction.

5.5.4 Monocle Differential Expression Analysis

  • This module is a graphical interface for the Monocle package (http://cole-trapnell-lab.github.io/monocle-release/).

  • This analysis is most proper for groups that represent the progress of a biological process, such as time of cell collection, cell state or media change. For details of this analysis please check the monocle paper and website.

5.6 Monocle Cell State Ordering

  • Currently you can only use less than 1000 features for cell state ordering. It is recommended that you use a marker feature list, the top genes ranked by overdispersion or the significant differentially expressed genes for ordering.

  • If you have provided group information, the comparison between monocle assigned state and your group will be available.

  • You can click the gene list to view the expression level of the selected gene plotted on the graph or as a function of pseudotime.

5.7 Correlation Analysis

  • You can generate pairwise scatterplot if you have less than 50 samples. This module will fail if you have too many samples.

  • The sample correlation heatmap provides a more intuitive way of visualizing the correlation between your samples. If you specifies color by group, a color bar will be added to the heatmap to show the group info. You can only use the correlation distance for hierarchical clustering in a correlation heatmap.

5.8 Feature Heatmap

  • You can rank features by fano factor, variance or mean expression. You are not allowed to use standardized data for ranking because all feature will have the same mean (0) and variance (1). However, if you want to plot in standardized scale, you can choose rank by “Row order”. alt text

  • The ranking of the heatmap is used for choosing the top ranked features to be plotted in the heatmap. You can use the slider to choose the features to be plotted. By default if the number of features is greater than 500, only the top 500 features will be plotted. Using the manual input, you can choose any rank range other than the top 500. The only limit is that you can only plot 1000 features at a time.

    alt text

    alt text

  • By default, the order of the samples (columns) and features (rows) shown in the heatmap will be determined by hierarchical clustering. Alternatively, if “Do not cluster sample/feature” is specifed, the samples/features will be ordered by the rank. If the user also specifies “rank by row order”, then the heatmap order will be exactly the same as the input row/column order.

    alt text

    alt text

  • d3heatmap package allows zoom in (click and drag) and zoom out (double click).

  • Press alt text to try different colors!

5.9 Dimension Reduction

5.9.1 PCA

  • You can look at any principal component. Percentage shown on each axis is the percent variance explained by that principal component.

    alt text

    alt text

5.9.2 T-SNE

  • 1D, 2D and 3D T-SNE are results of 3 different t-SNE processes (dims = 1, 2 or 3).

  • According to http://lvdmaaten.github.io/tsne/,

    “Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.”

    alt text

    alt text

5.10 Classification

  • This module is a graphical interface for the caret package (http://topepo.github.io/caret/index.html). Currently it allows the user to choose almost all classification models listed in ( http://topepo.github.io/caret/modelList.html).

  • Some methods may require new packages to be installed. In such cases, the background R session will ask you to install it. Choose yes if you want to proceed.

    alt text

    alt text

  • Users are expected to have the knowledge of which model is most suitable for their data. The details of these models can be found in the caret website.

  • Many methods contains built-in feature selection. For those having explicit coefficients for the features, the feature coefficient table will be available for download.

  • You can specify the cross-validation parameters to be used in the training.

    alt text

    alt text

  • For methods containing internal feature selection and explicit feature coefficients, the information will be available in the “Feature Coefficient” table.

  • The trained model can be used as a classifier for new data, or be used for the testing of the model. For example, if you train the classifier using data of group A and group B, and then perform testing on group A-t and group B-t. Then samples in group A-t and B-t will be classified as group A or group B, so that you can tell how well the model is performing. Alternatively, you can use the model to classify the unknow samples in e.g. group C.

6 Report Module

  • The code has been pre-written for you – you only need to add it. Please always run the module before you add it to report, because PIVOT requires the module has a last state to capture all the parameters.

    alt text

    alt text

  • The report code is based on R-markdown v2 (http://rmarkdown.rstudio.com/). You can add comments, change titles or modify the code (not recommended) according to the syntax.

7 System Control

  • The system control menu is located at the top right corner. Users can use this panel to save the program state, launch new session or go to data management panels.

    alt text

    alt text

7.1 Program State Saving and Loading

  • You can save the program state as an R data object. In this way you can make sure that you won’t lose your analysis progress, and you can share the state with others.

  • To load the saved state, go to File panel and choose PIVOT state in input file type. The session will auto refresh and immediately switch to the loaded state when the state uploading is complete.

    alt text

    alt text

  • Please note that although every analysis result will be kept in the state, PIVOT is not able to return the parameter choices to the ones you chose when the analysis was performed. In other words, the results and plots shown in the loaded state will stay the same, but the new parameter UI will return to the defaults.

  • Each time you exit, PIVOT will automatically save the state into the background R session. If you don’t close the R session, or save the workspace image before exiting R, the next time you launch PIVOT it will automatically load the state for you.

  • To clear the state, click alt text.

7.2 Launch New Session

  • Clicking the alt text button will launch new PIVOT session for you. In this way you can do different things in different sessions. There is a limitation for this: because R is single-thread, you can only perform one analysis at a time. While R is busy computing in one session, the other sessions will have to wait in a queue.

  • If you really want to simultaneously perform multiple analysis in multiple sessions, currently the only way is to open multiple copies of R, and launch PIVOT using the command “pivot()” in each R session.

8 Other Useful Information

8.1 Gene Expression Plot

  • In all differential expression analysis modules, you can click the result table to view the expression plot of individual genes.

    alt text

    alt text

8.2 Size of the Plot

  • You can change the width of most plots by resizing the window.

  • You can resize ggvis plots by using the bottom right triangle, and download the plot by clicking the cog icon on top right.

    alt text

    alt text

8.3 Citation Infomation

  • The citation and licensing information can be found at the bottom of each module.

    alt text

    alt text

8.4 Built-in Help Infomation

  • You can find alt text in many places in PIVOT. You can click it to view the relavant information of the module.

  • Many parameter inputs has tooltips containing the relavant info.

    alt text

    alt text